Part I - (Fordgobike Data Exploration)

by (Oluwabamise Omolaso)

Introduction

This data set includes information about individual rides (183000+) made in a bike-sharing system covering the greater San Francisco Bay area.

Preliminary Wrangling

There are null values present in the data and will be dropped.

Start time and end_time are in the wrong datatype format of object and will be changed to datetime fornat

There are no duplicates in the data

there are distances with 0km and this is because the ride occurred in the same neighborhood

From the above, by setting the age group limit to 100, we can see outliers present in the data and these will be dropped

What is the structure of your dataset?

There are initially 183412 records of individual rides in the dataset with 16 features ('duration_sec', 'start_time', 'end_time', 'start_station_id','start_station_name', 'start_station_latitude','start_station_longitude', 'end_station_id', 'end_station_name','end_station_latitude', 'end_station_longitude', 'bike_id', 'user_type','member_birth_year', 'member_gender', 'bike_share_for_all_trip') which changed to 23 features after feature engineering adding four new columns ('age', 'age_group','time_of_day','day_of_week', 'duration_mins', 'hour_of_day', 'distance_km'). After dropping null values, there were 174880 records left
The time variables were changed into the appropriate datatype{datetime) and it can be seen that the time range is from 2019-02-01 00:00:20.636000 (start time for the first ride) and 2019-03-01 08:01:55.975000 (the end time for the last ride).

The user type, member gender and bike share for all trip are ordered factor variables with the following levels:
user_type: Subscriber, Customer
member_gender: Male, Female, Others
bike_share_for_all_trip: Yes, No
age_group column was created: '<18', '18-44', '45-64', '65+' day_of_week:'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'

What is/are the main feature(s) of interest in your dataset?

I am interested in uncovering the best features to understand the bike sharing system:
Time range, Age_groups, duration, user_type, gender, bike_share for all trip, the locations for the trips

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The start and end time modified into days, weeks to undertand the variations of trips around these periods. It can be seen that all the rides took place in only one year (2019) and between february and march
The user_type, member_gender and bike_share_for_all trips are converted into categorical variable dtypes to understand the distribution of rides among these groups and whether or not shared rides have any difference to individual rides.
member ages were extracted from their birth years and then engineered a new categorical column for age groups
age groups were found to contain values greater than 100 and these were dropped from the data.

Univariate Exploration

Questions of interest (Univariate plots):

- What is the distribution of ride trips by user_type ? 
- What is the distribution of ride trips by member_gender ? 
- What is the distribution of ride trips by age_group ? 
- What is the distribution of ride trips by duration_mins ? 
- What is the distribution of ride trips by distance_km ?
- What is the distribution of ride trips by bike_share_for_all_trip ?
- What is the distribution of ride trips by hour_of_day ?
- What is the distribution of ride trips by day_of_week ?
- What is the distribution of start station latitude and longitude coordinates? 
- What is the distribution of end station latitude and longitude coordinates? 

What is the distribution of ride trips by user_type ?

It can be seen that a large proportion of users are subscribers

What is the distribution of ride trips by member_gender ?

The gender with the most ride trips is the male gender.

In line with this, I will investigate further to understand the average trip duration by gender

What is the distribution of ride trips by age_group ?

The younger age groups have the most ride trips and it declines at the extremes of ages

We will investigate further to understand the average duration by age_group

What is the distribution of ride trips by duration_mins ?

there is a right skew seen with majority of the ride trips within less than 200mins at the lower end.

we will want to investigate futher by first using an axis limit to zoom into the data

By limiting the ride duration to 60mins, we can see majority of the duration in the dataset lies under 1hr

The duration of most trips is between 0-2hrs and appears to have a normal distribution with few data points extending beyond 2hrs

From the visualizations above we have safely established that the average duration is 11mins and typically less than 200mins hence we would deal with the outliers.

Apparently there seems to be a mismatch between duration and distance travelled as you expect longer trips to take more time but it can be roughly seen that lesser trips are taking more time in some cases and will be investigated further under bivariate plots

What is the distribution of ride trips by distance_km ?

The distance covered in most trips is apparently less than 10km and also has a right skew

The log of 0 will give infinity and as explained earlier, 0 as found in the data refers to rides that occurred within the same neighborhood

there is a steep cut because all distance values of 0 distances were shut off because of the log transformation

from the box plot we can see one clear outlier in the data beyond 20km and this can be dropped.

Now the plot looks better

What is the distribution of ride trips by bike_share_for_all_trip ?

A large proportion of the trips are not equally distributed

What is the distribution of ride trips by time_of_day ?

Appears to have two peaks, by 8am and 5pm which fairly corresponds to the beginning of the a typical work day and close of the work day

What is the distribution of ride trips by day_of_week ?

It can be seen that most trips occur during the weekdays and less on weekends.

What is the distribution of start station latitude and longitude coordinates?

What is the distribution of end station latitude and longitude coordinates?

Majority of the locations fall within a narrow spectrum of start & end latitudes and longitudes with two distinct regions for the latitude (between 37.3 and 37.4) and longitude (between -122.4 and -122.2 and -121.9)

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Distribution of ride trips by user_type - It can be seen that a large proportion of users are subscribers
Distribution of ride trips by member_gender - The gender with the most ride trips is the male gender.
Distribution of ride trips by age_group - The working class age groups have the most ride trips and it declines at the extremes of ages
Distribution of ride trips by duration_mins - By limiting the ride duration to 60mins, we can see majority of the duration in the dataset lies under 1hr
Distribution of ride trips by distance_km - The distance covered in most trips is apparently less than 10km and also has a right skew
Distribution of ride trips by bike_share_for_all_trip - A large proportion of the trips are not equally distributed with more Nos
Distribution of ride trips by hour_of_day - Appears to have two peaks, by 8am and 5pm which fairly corresponds to the beginning of the a typical work day and close of the work day
Distribution of ride trips by day_of_week - It can be seen that most trips occur during the weekdays and less on weekends.
Distribution of start & end station latitude and longitude coordinates - Majority of the locations fall within a narrow spectrum of start & end latitudes and longitudes with two distinct regions for the latitude (between 37.3 and 37.4) and longitude (between -122.4 and -122.2 and -121.9)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Duration(mins) - had an unsual distribution with a right skew and i had to use an xlimit on the x-axis to zoom into the region of the data with the most datapoints and found that most ride durations last less than 1hr and upon log transformation, we can see a unimodal normal distribution!

Distance(km) - had an unusual distribution with a right skew and I had to use an xlimit on the x-axis to zoom into the region of the data with the most datapoints and appeared fairly unimodal with a right skew, and seems to follow the same distribution with duration except that log transformations for 0 distances were cut off. and I wonder the relationship between duration and distance.

Bivariate Exploration

To begin with, I want to look at the pairwise correlations present between features in the data.

As expected, there is a fairly strong correlation between start station latitudes and longtitudes but decreases with other variables. And general weak correlations across board. However, it is surprising to see that even distance and time also have a weak positive correlation.

It can be seen that there are similar distributions for start longitude and latitude for the other variables. There appears to be a huge proportion of distance covered within the age brackets of 20-40, and at distances below 10km. Age and duration appears to be clustered close to the y-axis, and age and hour of day appear to be clustered around the centre. Duration and distance show similar pattern to age and distance with visible outliers. Hour of day shows similar patterns with age, and distance but clustered to the y-axis for duration as majority of the ride trips did were less than 1hr and visible outliers can be seen.

Explore relationship between distance covered and duration

Looking at the parameters above it can be seen that there is some mismatch in the data as one would expect longer distances to take more time but that is not the case here as there seems to be an inverse relationship where shorter distances take even the most time. Although we have established earlier that distances within the same longitude and latitude is 0 hence will not be a good predictor here.

Now I want to investigate how duration and distance vary with the categorical varibles

From the visualization, it can be seen that there are quite a number of outliers across all categorical variables for distance as majority of the trips lie below 1hr which makes it difficult to tell the differences for each group. The subscibers in the univariate plot had a higher count however, customers cover more distances. Although we saw that the male gender was the most prominent by count, the distances covered by the genders did not vary much. Same goes for the bike share for all trips where the higher count was recorded for Yes but in terms of distance travelled, those not in the sharing scheme have a higher amount. The distribution of distance covered among age groups give a not so surprising information as majority of the rides fall within the working class group and the average ride distance for the 60+ group apperars to be nearly as the 30-39 and despite having a far lesser count in the univariate plots which can mean older people in this category take longer bike trips or there are large outliers.

We will take a closer look at the duration plots with the categorical variables

From the visualization above, distance travvelled did not seem to vary much among the different categories. Although there's an inverse relationship noticed in the distance travelled by those who share bikes for their trips who despite form the higher proportion in terms of count, actually have a lesser distance travelled and a similar pattern is seen for user_type and member gender where customers with lesser count travel more distances than subscribers. It could be the effect of outliers

We can investigate age further beyond using the age group, we can use their actual ages

As it can be seen, majority of the data lies in less than 80 age range as ouliers have been dropped earlier but there's still a few amount of outliers here between 80-100 and can actually be a result of incacurate data entry and we can remove these outliers as anyone beyond 80years of age should not be physically fit enough to ride a bike, besides, majority of duration travelled do not exceed 200mins, and it can be seen as outliers. It is also seen that younger people travel more hours which is expected.

from the visualization above, it can be seen that younger people travel more distances than older people which follows a similar pattern for duration seen above

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The striking observation thus far is in the relationship between distance and duration. One would expect a linear relationship and a strong positive correlation as the time taken is expected to increase as distance increases but that is not the case here as we see an inverse relationship.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

For age, there were few outliers seen, as anyone above 80 is not really likely to still be riding, however it's an insignificant amount in the data so it may be expected. It is also seen that people of younger ages ride for longer hours compared to the older age group It is also seen that younger people travel more distances than older people and it is also expected For distribution across the categorical variables, there is a reverse relationship between the count and the level of utility, where despite more male riders, females appear to have longer distances travelled.Same is seen for user type, where ther are more subscribers, but customers seem to have longer time for travel and distance.

Multivariate Exploration

Questions of interest:

Here we can see that just as the univariate plot suggested that thursday is the day with most rides, it's also visible here too

How does the average trip duration vary by user type, gender, and age group?

We have seen in the bivariate plots that the female and other gender tend to have higher average trip durations compared to the male gender and customers are also higher in average trip duration compared to subscribers and this is seen clealy

We have seen in the univariate plots that our duration generally falls below 200mins and it's visible here by the heatmap and are generally across three major latitude/longitude locations

The clustered bar chart here illustrates the number of trips per usertype and still hihglights the weekdays for more trips (thursday being the highest) and subscribers taking more trips compared to customers

The map shows the data is clustered around three major stations on the map

The chart above checks the distribution of the member gender and the distances travelled and durations. It shows clearly that while majority of the groups tend to travel lesser distances, the male gender have the highest duration and distance alongside, the female genders is also widespread too in terms of duration travelled compared to distance covered

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I investigated further on the relationships between the trip duration, the member_genders and the user_types and it shows that the females and others gender categories have higher average trip durations despite lesser counts from the univariate plots and more trips happen on weeekdays or weekends and there are three major clusters for the station locations based on the longitude and latitude. The duration travelled decreases with increasing age which is expected.

Were there any interesting or surprising interactions between features?

It was interesting to find out that there is an inverse relationship in terms of duration travelled and distance travelled for males and females. Also it was surprising to see that distance and duration also had an inverse relationship in the data which can point to errors in the data collection process or require more investigation!

Conclusions

In summary, during the exploration of this dataset, I performed preliminary datawrangling, dropped null columns, changed date columns to proper datetime formats, and did a bit of feature engineering to create a new column for day of the week and hour of day. I also worked on the latitude and longitude columns to create a new column for distance travelled in km. Using the member year, age and age group columns were engineered to understand the distribution of the data by age groups.

Visualizations were done to explore each feature first (univariate) and investigating the duration and distance columns I discovered there were outliers that needed to be dealt with and log transformations to understand the distribution of the data. Bivariate plot showed inverse relationships between the duration and distances covered and a reversal of results of univariate plots where male gender initially had a higher count in terms of number of rides but surprisingly had a lesser average trip duration. Multivariate plots explored and expounded on the established facts.

In conclusion, the fordgobike data exploration showed that there needs to be more investigation on the data collection process as it reveals an indirect relationship between duration and distance, a tendency for more female trip durations despite having fewer rides and seems inconsistent with reality.